Processing Hansard Documents with Portage

نویسنده

  • George Foster
چکیده

The House of Commons translation task involves handling documents that are transcriptions of parliamentary sessions (Hansard) or of the proceedings of various parliamentary committees. The documents are in XML, and generally contain parts where the source language is English, and parts where it is French. The desired output is a set of similarly-formatted XML documents containing both the original passages and their translations. This task poses several problems for automatic processing with Portage: using specialized models for different sub-genres (eg Finance Committee proceedings versus Hansard), running translation in diffferent directions for different parts of a document, and preserving XML structure across translation. The last of these problems is particularly hard when XMLmarkup occurs within a sentence, because it may be difficult or impossible to identify a corresponding marked-up segment in Portage’s output. This document describes the processing strategy that was used in the Translation Bureau trials of September 2010, and refined shortly afterwards. This includes the translation process only; it does not deal with model training, offline adaptation, or exploiting document structure to improve translation, for instance. The processing strategy used for this task can be divided into parts that are Hansard specific, and parts that could be re-used in other settings. The rest of this document reflects this division: section 2 describes the overall strategy and how the two parts fit together; section 3 describes Hansard-specific steps, and section 4 describes general processing steps.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Truecasing For The Portage System

This paper presents a truecasing technique that is, a technique for restoring the normal case form to an all lowercased or partially cased text. The technique uses a combination of statistical components, including an N-gram language model, a case mapping model, and a specialized language model for unknown words. The system is also capable of distinguishing between “title” and “non-title” lines...

متن کامل

Experiences with Parallelisation of an Existing NLP Pipeline: Tagging Hansard

This poster describes experiences processing the two-billion-word Hansard corpus using a fairly standard NLP pipeline on a high performance cluster. Herein we report how we were able to parallelise and apply a “traditional” single-threaded batch-oriented application to a platform that differs greatly from that for which it was originally designed. We start by discussing the tagging toolchain, i...

متن کامل

Translating Structured Documents

Machine Translation traditionally treats documents as sets of independent sentences. In many genres, however, documents are highly structured, and their structure contains information that can be used to improve translation quality. We present a preliminary approach to document translation that uses structural features to modify the behaviour of a language model, at sentence-level granularity. ...

متن کامل

Portage and Path Dependence.

We examine portage sites in the U.S. South, Mid-Atlantic, and Midwest, including those on the fall line, a geomorphological feature in the southeastern U.S. marking the final rapids on rivers before the ocean. Historically, waterborne transport of goods required portage around the falls at these points, while some falls provided water power during early industrialization. These factors attracte...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010